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ROCQUENCOURT 



Towards Practical Typechecking for Macro Tree Transducers 



Abstract: Macro tree transducers (mtt) are an important model that both covers many useful 
XML transformations and allows decidable exact typechecking. This paper reports our first step 
toward an implementation of mtt typechecker that has a practical efficiency. Our approach is to 
represent an input type obtained from a backward inference as an alternating tree automaton, 
in a style similar to Tozawa's XSLTO typechecking. In this approach, typechecking reduces to 
checking emptiness of an alternating tree automaton. We propose several optimizations (Carte- 
sian factorization, state partitioning) on the backward inference process in order to produce 
much smaller alternating tree automata than the naive algorithm, and we present our efficient 
algorithm for checking emptiness of alternating tree automata, where we exploit the explicit rep- 
resentation of alternation for local optimizations. Our preliminary experiments confirm that our 
algorithm has a practical performance that can typecheck simple transformations with respect 
to the full XHTML in a reasonable time. 
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Vers un typage praticable pour les macro transducteurs d'arbre 

Resume : Les macro transducteurs d'arbre (mtt) constituent un modele important, dans la 
mesure ou ils permettent de realiser de nombreuses transformations XML et ou ils admettent 
un typage exact decidable. Cet article rend compte d'une premiere etape en direction de 
l'implementation d'un typeur pour les mtt efncace en pratique. Notre approche consiste a 
representer le type d'entree obtenu par inference inverse sous la forme d'un automate d'arbre 
alternant, dans un style similaire a celui introduit par Tozawa pour le typage de XSLTO. Le 
probleme de la verification du bon typage du transducteur se reduit alors a celui du test de 
vide pour un automate d'arbre alternant. Nous proposons plusieurs optimisations (factorisation 
cartesienne, partionnement des etats) pour le processus d'inference inverse, avec l'objectif de 
produire des automates alternants significativement plus petits qu'avec l'algorithme naif. Nous 
decrivons egalement un algorithme efncace pour le test de vide pour un automate d'arbre al- 
ternant, dans lequel nous exploitons la representation explicite de l'alternation pour permettre 
des optimisations locales. Nos experiences preliminaires confirment que notre algorithme atteint 
des performances sumsantes pour typer des transformations par rapport a la DTD XHTML 
complete, en un temps raisonnable. 

Mots-cles : automates d'arbre, transducteurs d'arbre, typage exact, automates alternants 
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Static typechecking for XML transformations is an important problem that has expectedly a 
significant impact on real- world XML developments. To this end, several research groups have 
made efforts in building typed XML programming languages [HJ [3] with much influence from 
the tradition of typed functional languages [2j [10]. While this line of work has successfully 
treated general, Turing- complete languages, its approximative nature has resulted in an even 
trivial transformation like the identity function to fail to typecheck unless a large amount of 
code duplicates and type annotations are introduced p 7 ]. Such situation has led us to pay 
attention to completely different approaches that have no such deficiency, among which exact 
typechecking has emergingly become promising. The exact typechecking approach has extensively 
been investigated for years ^2], [20], HQ E3, EH EH El US [H H31 [El HI], in which macro tree 
transducers (mtt) have been one of the most important models since they allow decidable exact 
typechecking [5], yet cover many useful XML transformations [5] [HJ HJ [19] . Unfortunately, these 
studies are mainly theoretical and their practicality has never been clear except for some small 
cases [231126] , 

This paper reports our first step toward a practical implementation of typechecker for mtts. 
As a basic part, we follow an already-established scheme called backward inference, which com- 
putes the preimage of the output type for the subject transformation and then checks it against 
the given input type. This is because, as known well, the more obvious, forward inference does 
not work since the image of the input type is not always a regular tree language in general. Our 
proposal is, on top of this scheme, to use a representation of the preimage by an alternating 
tree automaton [21], extending the idea used in Tozawa's typechecking for XSLTO [23]. In this 
approach, typechecking reduces to checking emptiness of an alternating tree automaton. 

Whereas normal tree automata use only disjunctions in the transition relation, alternating 
tree automata can use both disjunctions and conjunctions. This extra freedom permits a more 
compact representation (they can be exponentially more succinct than normal tree automata) 
and make them a good intermediate language to study optimizations. Having explicit represen- 
tation of transitions as Boolean formulas allowed us to derive optimized versions of the rules 
for backward inference, such as Cartesian decomposition or state partitioning (Section 14.11) . 
These optimizations allow our algorithm to scale to large types. We also use Boolean reasoning 
to derive an efficient emptiness algorithm for alternating tree automata (Section 14.21) . For in- 
stance, this algorithm uses the following fact as an efficient shortcut: when considering a formula 
<j) = 0i A <f>2, if 4>x turns out to denote an empty set, then so is and thus the algorithm doesn't 
even need to look at 02- Note that the exploited fact is immediately available in alternating tree 
automata, while it is not in normal tree automata. 

We have made extensive experiments on our implementation. We have written several sizes 
of transformations and verified against the full XHTML automatically generated from its DTD 
(in reality, transformations are often small, but types that they work on are quite big in many 
cases; excellent statistical evidences are provided in [IT].) The results show that, for this scale 
of transformations, our implementation has successfully completed typechecking in a reasonable 
time even with XHTML, which is considered to be quite large. We have also compared the 
performance of our implementation with Tozawa and Hagiya's [26] and confirmed that ours has 
comparable speed for their small examples that are used in their own experiments. 

On the theoretical side, we have established an exact relationship with two major existing 
algorithms for mtt typechecking, a classical algorithm based on "function enumeration" [1] and 
an algorithm proposed by Maneth, Perst, and Seidl (MPS algorithm) [12J. Concretely, we have 
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proved that (1) the classical algorithm is identical to our algorithm fegfoffi&q&y* cfeftrlimgafflm 
of an alternating tree automaton, and that (2) MPS algorithm is identical to our algorithm 
followed by emptiness test of an alternating tree automaton. A particular implication is that our 
algorithm inherits one of useful properties of MPS algorithm: polynomial-time complexity under 
the restriction of a bounded number of copying |12j (mtt typechecking is in general exponential- 
time complete). The proofs appear in the appendix, however, since this paper is focused rather 
on the practical side. 

Related work Numerous techniques for exact typechecking for XML transformations have 
been proposed. Many of these take their target languages from the tree transducer family. 
Those include techniques for macro tree transducers [12j [4] , for macro forest transducers [20] , 
for fc-pebble tree transducers [EJH], for subsets of XSLT [23], [26], for high-level tree transducers 
[24], and a tree transformation language TL [TT]. Other techniques treat XML query languages 
in the select-construct style [EKHIB] or even simpler transformations [THUH]. Most of the above 
mentioned work provides only theoretical results; the only exceptions are [23, 26j, where some 
experimental results are shown though we have examined much bigger examples (in particular 
in the size of types). 

Several algorithms in pragmatic approaches have been proposed to address high complexity 
problems related to XML typechecking. A top-down algorithm for inclusion test on tree au- 
tomata has been developed and used in XDuce typechecker [9J; an improved version is proposed 
in |22j . A similar idea has been exploited in the work on CDuce on the emptiness check for 
alternating tree automata [6]; the emptiness check algorithm in our present work is strongly 
influenced by this. Tozawa and Hagiya have developed BDD-based algorithms for inclusion test 
on tree automata [25] and for satisfiability test on a certain logic related to XML typechecking 

Overview This paper is organized as follows. In Section O we recall the classical definitions 
of macro tree transducers (mtt), bottom- up tree automata (bta), and alternating tree automata 
(ata). In Section EJ we present the two components of our typechecking algorithm: backward 
type inference (which produces an ata from an mtt and a deterministic bta) and emptiness check 
for alternating tree automata. In Section [4j we revisit these two components from a practical 
point of view and we describe important optimizations and implementation techniques. In 
Section [5J we report the results of our experiments with our implementation of the typechecker 
for several XML transformations. In Section El we conclude this paper with our future direction. 
Appendix[A]is devoted to a precise comparison between our algorithm and the classical algorithm 
or the Maneth-Perst-Seidl algorithm for typechecking mtt. We show that each of these algorithms 
can be retrieved from ours by composing with a know algorithm. In Appendix [Bj we propose 
the notion of bounded-traversing alternating tree automata, which is a natural counterpart of 
syntactical bounded-copying mtts as proposed in [12]. We show in particular that this notion 
ensures that the emptiness check runs in polynomial time. 

2 Preliminaries 

2.1 Macro Tree Transducers 

We assume an alphabet X where each symbol a € X is associated with its arity; often we write 
to denote a symbol a with arity n. We assume that there is a symbol e with zero-arity. 
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v :;= a( n )(wi, . . . ,v n ) 

We write e for e() and v = (v±, . . . , v n ) to represent a tuple of trees. Assume a set of variables, 

ranged over by x, y, A macro tree transducer (mtt) T is a tuple (P, Pq, II) where P is a finite 

set of procedures, Pq C P is a set of initial procedures, and II is a set of (transformation) rules 
each of the form 

p( k )(a {n) (x l ,...,x n ),yi,...,y k ) -> e 

where each j/j is called (accumulating) parameter and e is a (n, fe)-expression. We will abbreviate 
the tuples (xi, . . . , x n ) and (yi, . . . , yt) to x and y. Note that each procedure is associated with 
its arity, i.e., the number of parameters; we write p^ to denote a procedure p with arity k. An 
(n, fc)-expression e is defined by the following grammar 

e :;= a (m) (ei, . . . , e m ) | p^ l \x h , e u . . . , e/) | y,- 

where only y^ with 1 < j < k and Xh with 1 < /i < n can appear as variables. We assume that 
each initial procedure has arity zero. 

We describe the semantics of an mtt (P, Po,II) by a denotation function [•]. First, the 
semantics of a procedure p^> takes a tree a^ n \v\, . . . , v n ) and parameters w = (w\, . . . , w^) and 
returns the set of trees resulted from evaluating any of p's body expressions. 

lp^j(a^(v),w)= (J [e](v, w) 

(p( fe )(a(™) (x),y)^e)eU 

Then, the semantics of an (n, fc)-expression e takes a current n-tuple v = (v±, . . . ,v n ) of trees 
and a /c-tuple of parameters w = (wi, ... ,Wk), and returns the set of trees resulted from the 
evaluation. It is defined as follows. 

[a(™) (ei , . . . , e m )] (v, w) = {a^ (v[ ,...,v'J \ v\ G [e,] (v,w) for i = 1 , . . . , m} 
lp (l \x h ,e 1 ,...,ei)j(v,w) = {lp^](v h ,(w[,...,w' l )) \ w'- £ [ej]{v,w) for j = l,...,l} 
lVji(v,w) = {wj} 

A constructor expression a^ m \ei, . . . , e m ) evaluates each subexpression and reconstructs a tree 
node with a and the results of these subexpressions. A procedure call p(xh, e±, . . . ,ei) evaluates 
the procedure p under the h-th subtree Vh, passing the results of e±, . . . ,ei as parameters. A 
variable expression yj simply results in the corresponding parameter's value Wj. Note that an mtt 
is allowed to inspect only the input tree and never a part of the output tree being constructed. 
Also, parameters only accumulate subtrees that will potentially become part of the output and 
never point to parts of the input. 

The whole semantics of the mtt with respect to a given input tree v is defined by T(v) = 
Up eP Ipo](^)- An mtt T is deterministic when T{v) has at most one element for any v; also, 
T is total when T(v) has at least one element for any v. We will also use the classical definition 
of images and preimages: T(V) = \J ve y T{v), T -1 ^') = {v \ 3v' G V'.v' G T{v)}. 

2.2 Tree Automata and Alternation 

A (bottom-up) tree automaton (bta) M. is a tuple (Q,Qf,A) where Q is a finite set of states, 
Qf Q Q is a set of final states, and A is a set of (transition) rules each of the form q <— 
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fi (ra) (<7i, . . . ,<7n) where each q { is from Q. We will write q for the tuV>^ffi¥#, qJfmfr<ffP&°®€& 
M. = (Q,Qp, A), acceptance of a tree by a state is defined inductively as follows: M. accepts 
a tree a^ n \v) by a state q when there is a rule q <— a( n )(<f) in A such that each subtree Vi is 
accepted by the corresponding state M. accepts a tree v when M. accepts v by a final state 
q € Qf- We write {q\ M for the set of trees that the automaton M. accepts by the state q (we 
drop the subscript M. when it is clear), and C(M) = U<yeQ F lol f° r the set of trees accepted 
by the automaton M.. Also, we sometimes say that a value v has type q when v is accepted by 
the state q. A bta (Q,Qp,A) is complete and deterministic when, for any constructor and 
n-tuple of states q, there is exactly one transition rule of the form q <— a^ n \q) in A. Such a bta 
is called deterministic bottom-up tree automaton (dbta). For any value v, there is exactly one 
state q such that v € \q\. In other words, the collection {\q\ | q € Q} is a partition of the set of 
trees. 

An alternating tree automaton (ata) A is a tuple (E, Eo,$) where E is a finite set of states, 
Eo C S is a set of initial state, and $ is a function that maps each pair (A, a^) of a state and 
an n-ary constructor to an n-formula, where n-formulas are defined by the following grammar. 

:: = U X | 0i V0 2 | 01 A<^>2 | T | ± 

(with 1 < i < n). In particular, note that a 0-ary formula evaluates naturally to a Boolean. 
Given an ata A = (E, Eo, 3>), we define acceptance of a tree by a state. A accepts a tree a^ n \v) 
by a state X when v h &(X, a^) holds, where the judgment v h <p is defined inductively as 
follows: 

• v h 4>i A cf>2 if v h 4>\ and v h 02- 

• u h 0i V 02 if w l~ ^1 or w h 02 • 

• v h T. 

• v X ii A accepts Vi by X. 

That is, v h intuitively means that holds by interpreting each |j X as "uj has type X." We 
write [A] for the set of trees accepted by a state X and [0] = {v \ v h 0} for the set of n-tuples 
accepted by an n-formula 0. We write C(A) = Ux gs I-^o] f° r the language accepted by the 
ata A. Note that a bta M. = (Q,Qf 5 A) can be seen as an ata with the same set of states 
and final states by defining the function $ as ^(q,a^) = V(g < - a ( n )( ( f))GA A«=i n i» 9»> an( ^ the 
definitions for the semantics of states and the language accepted by the automaton seen as a 
bta or an ata then coincide. We will use the notation ~ to represent semantical equivalence of 
pairs of states or pairs of formulas. 

3 Typechecking 

3.1 Backward inference 

Given a dbta A^out ("output type"), a bta M.\ n ("input type"), and an mtt T, the goal of 
typechecking is to verify that T(£(A / [; n )) C C(Ai ut)- It is well known that T(C(Mi n )) is 
in general beyond regular tree languages and hence the forward inference approach (i.e., first 
calculate an automaton representing T{C(M.- m )) and check it to be included in £(A4 ou t)) does 
not work. Therefore an approach usually taken is the backward inference, which is based on 
the observation that T(£(A4 in )) C £(A4 ou t) £(A4 in ) n T- l (C{M)) = 0, where M is 



INRIA 



tff£TOfi^)lgll^ C(M i n)nT- 1 (C(M)) jj 

not empty, then it is possible to exhibit a tree v in this intersection. Since this tree satisfies 
that v G C(A4m) and T{y) % £(A4 out ), it means that there is a counter-example of the well- 
typedness of the mtt with respect to the given input and output types. Algorithmically, the 
approach consists of computing an automaton A representing T~ 1 (C(M)) and then checking 
that C(Ai- m ) fl C(A) = 0. Since the language T~ 1 (C(M)) is regular and indeed such automata 
A can effectively be computed, the above disjointness is decidable. 

The originality of our approach is to compute A as an alternating tree automaton. Let a 
dbta M. = (Q,Qf,A) and an mtt T = (P,Pq,II) be given. Here, note that the automaton M., 
which denotes the complement of the output type -A/f ut, can be obtained from Ai ut in a linear 
time since M ou t is deterministic. From M and T, we build an ata A = (3, 3o,$) where 

3 = {(p {k) ,q,q) \p {k) eP, qeQ,qeQ k } 

3o = {(po, q) | Po G Po, q G Qf} 

H(/ k) ,q,q),a^) = \J Inf(e,g,g). 

(p(fc)( a («) _» e )en 

Here, the function Inf is defined inductively as follows. 

Inf(6( m )( ei ,...,e m ),g,g) \/ /\ lnf(e„^,0 



V 




q'GQ 1 






(q = Qj) 


{I 


(q + qj) 



3=l,;l 



Let us explain why this algorithm works. Since a precise discussion is critical for understand- 
ing subsequent sections, we summarize our justification here as a formal proof. 

Theorem 1 C{A) = T- l {C{M)). 

Proof: Intuitively, each state (p,q,q) represents the set of trees v such that the procedure p 
may transform v to some tree u of type q, assuming that the parameters yi are bound to trees 
Wi each of type Formally, we prove the following invariant 

Vu. Vw G 0. v G {{p^,q,q)} b W ]M n {qj + (1) 

where w G [g| means w\ G [gi], ... ,Wk G Note that this invariant implies that the right- 

hand side does not depend on the specific choice of the values Wi from the sets [<&]; this point 
will be crucial later. From this invariant, the initial states 3o represent the set of trees that we 
want and hence the result follows: 

= (J{[<Po,g}J IpogPq, q g Qf} 

= {v\ bo](«)nfe]^0, Po eP , qeQ F } 

= {v | T(v) nC(M) + 0} 

The proof of the invariant proceeds by induction on the structure of v. For the proof, we 
first need to consider an invariant that holds for the function Inf. Informally, Inf(e,g,g) infers 
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In n-formula representing the set of n-tuples v such that the exp^§slbifiyfifeiy %¥iMcS& s W{b 
some tree of type q, assuming that the parameters are bound to trees Wi each of type q^. 
Formally, we prove the following: 

W. Vwe0.se [Inf (e, q, q)j [e] (v, w) n [g] ^ (2) 

Indeed, this implies the invariant JT]). Let i> = a^ n \v); for all to G [gf: 

^e[(P (fe) ,'?,9)] ^G[$((p (fc) ,9,Q),a (n) )] 

3(p^(a^ n \x),y) ^ e )eU.ve [Inf (e, g, g)] 

fey 



3(p W (« (n) (£),y) - e) e n. [e](i/,iZ0 n[?]/0 
The invariant ((2|) is in turn proved by induction on the structure of e. 

Case e = b^ m \ei, . . . ,e m ). In order for a tree u of type q to be produced from the constructor 
expression, first, there must be a transition q <— b^ m \q') G A. In addition, it's each subtree 
must have type q[ and must be produced from the corresponding subexpression ej. For 
the latter condition, we can use the induction hypothesis for j2|). Formally, for all w G [g|: 

</e[Inf(e,g,g)] »e[ \/ /\ Inf(e„^] 

^ 3(g <- 6( m )((T')) G A- Vi = 1, . . . , m. v G [Inf( ei , q' j} q)j 

byLHAr© ^ ^ &(m)( ^ }) £ A> = _ > m . [^(tf^ n ^ 

<=> H(«>ln[^« 

Case e = p( l \xh, e\, . . . ,ei). In order for a tree u of type q to be produced from the procedure 
call, first, a tree w'j of some type q'j must be yielded from each parameter expression 
ej. In addition, the h-th input tree must have type (p^, q, (q[, ■ ■ ■ , q[)) since the result 
tree u must be produced by the procedure from the h-th input tree with parameters 
w'i, ■ ■ ■ , Wi of types q[, . . . , q[. We can use the induction hypothesis for ([2]) for the former 
condition and that for JT]) for the latter condition. Formally, for all w G [gj: 

v G [Inf (e, q, q)\ v £ [\/ i h (p,q,^) A /\ In%, gj, g)] 

q'eQ 1 3=l,~,l 
<=> 3q> G Q l . v h G l(p, q, q')j A Vj = 1, . . . , I v G [Inf (e js gj, g")] 
byLHAr© g ^ £ ^ ^ g ^ ^ ^ /\\fj = j [ e .J ( ^ ^ n j ? / j 

<{=> 3^'GQ'. «fc€[(p, 3,^)] (3) 
A 3«/. Vj = 1, . . . , I. v/j G [ejj{v, w) A ^ G [gj.] 

We can show that the last condition holds iff 

3«/. [p (0 l K w')nM 7 t0AVj = l,... ) UjG [ ej -] (if, w) (4) 
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Towa^^r^^l^i^^^^Macrg^e^ f^jff&tqHaeed, for the "only if" direction, w§ 
apply the induction hypothesis for JT]) where we instantiate w with the specific w' in §3§ — 
this is exactly the place that uses the fact that the quantification on w appears outside 
the " " in JT]) — and obtain the following: 

3q' e Q l . 3w'. \p®l{v h ,w>) n [g] + (5) 
A Vj = 1, . . . , I. w'j G [ejjiv, w) A w) G [g^] 

By dropping the condition w'j G {q'j] (and the unused quantification on g'), we obtain 

For the "if" direction, since that the automaton M. is complete, i.e., there is in general a 
state q for any value w such that w G [g], we obtain j5]) from Then, the induction 
hypothesis for (CQ) yields ([3|). 

Case e = yj. In order for a tree of type q to be produced from the variable expression, yj must 
have type q. Formally, first note that v G [Inf(e,g,g)] q = qj, for any v. Note also 

that, since A4 is deterministic bottom-up, all the states are pair-wise disjoint: [g] fl [g'J = 
whenever q ^ q' . Therefore, for all w G [gf: 

u G [Inf(e,g,g)] <{=>• g = g,- 

Wj- G [g] 

^ [e]Mn[?]^ 

□ 

In the proof above, the case for variable expressions critically uses the determinism constraint. 
Indeed, the statement of the theorem does not necessarily hold if A4 is nondeterministic. For 
example, consider the nondeterministic bta A4 with the transition rules 

go<-%i,g2) gi <- e g2 <- e 

(go is the initial state) and typecheck the mtt T with the transformation rules 

p (a(xi)) p(xi,e) 
p(e,yi) -> b(yi,yi) 

(po is the initial procedure) with respect to the result type go. With this mtt, the input value a(e) 
translates to b(e, e), which is accepted by A4. However, our algorithm will infer an input type that 
denotes the empty set, which is incorrect. To see this more closely, consider inference on the body 
of p with the result type g = go and the parameter type g = (gi). The condition J2]) does not hold 
since the only choice of w G [gj is w = (e) and, in this case, the right hand side holds whereas the 
left hand side does not since Inf (b(yi , y% ) , go , (gi ) ) = Inf (yi , gi , (gi ) ) A Inf (yi , g2 , (gi ) ) = T A _L = 
_L. The same argument can be done with the parameter type q = fa)- Now, in inference on the 
body of po with the result type go, the call to p must have parameter type gi or g2 since only 
these can accept e. From the previous inference, we conclude that the input type inferred for 
the call is again the empty set type; so is the whole input type. 

However, the variable case is the only that uses determinism. Therefore, if the mtt uses 
no parameter, i.e., is a simple, top-down tree transducer, then the same algorithm works for 
a non-deterministic output typeQ Moreover, if the mtt T is deterministic and total, we have 

1 Completeness of the output type is not needed for our algorithm to work on top-down tree transducers. 
This is because the only place where we use completeness in the proof is the case for procedure calls, in which 
completeness is actually not necessary if there is no parameter. 
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^HcjMout)) = T-HdMont)). it suffices to check c(M in ) ^P^e^j^j^Mmmu 

jC(A^in) H 1 (£(.Mout)) = 0. This could be advantageous since a direct conversion from an 
XML schema yields a non-deterministic automaton, and determinizing it has a potential blow- 
up (though this step is known to take only a reasonable time in practice) whereas inclusion can 
be tested more efficiently by using known clever algorithms that avoid a full materialization 
of a deterministic automaton [9j [22j [25]. Tozawa presents in his work [23] a backward infer- 
ence algorithm based on alternating tree automata for deterministic forest transducers with no 
parameters where he exploits the above observation to obtain a simple algorithm. 

Finally, it remains to check C(Mi n ) PI C(A) = 0, for which we first calculate an ata A' 
representing C{M.\ n ) D C(A) (this can easily be done since an ata can freely use intersections) 
and then check the emptiness of A' . The next section explains how to do this. The size of the 
ata A is polynomial in the sizes of .Mout and of T. The size of A' is thus polynomial in the sizes 
of A4 in , M out , and T. 



3.2 Emptiness check 

Let A = (H, So, 4?) an alternating tree automaton. We want to decide whether the set C(A) 
is empty or not. We first define the following system of implications p where we introduce 
propositional variables X consisting of all subsets of S: 

p = {X^X l A...AI„ I 3a« (X 1 ,...,X n ) GDNF(A XeX ^(^« (n) ))}} 

Here, DNF(0) computes $>'s disjunctive normal form by pushing intersections under unions and 
regrouping atoms of the form |j X for a fixed i; the result is formatted as a set of n-tuples of 
state sets. More precisely: 

DNF(T) = {(0,...,0)} 

DNF(_L) = 

DNF(^A0 2 ) = {(X~i U Yi, . . . ,X n UY n ) \ (X x , . . . ,X n ) G DNF(<fo), (Yl, . . . ,Y n ) e DNF(0 2 )} 

DNF(<£iV0 2 ) = DNF(0i) UDNF(0 2 ) 

DNFQh X) = {(0,...,0,{X},0,...,0)} (the /i-th element is {X}) 

Then, with the system of implications above, we verify that p h {X} for some I £ Ho. The 
judgment p h X here is defined such that it holds when it can be derived by the single rule: if 
p contains X X\ A ... A X n and p\~ Xi for any i = 1, . . . , n, then p h X. 

Each propositional variable X intuitively denotes that the intersection of the sets denoted 
by all the states in X is non-empty: f)xe~x 0- Thus, we can prove the following. 



Proposition 1 C(A) ^ iff ph {X} for some X G 



Proof: The result follows by showing that v G Oxex 1^1 ^ or some v iS p\- X. The "only if" 
direction can be proved by induction on the structure of v. The "if" direction can be proved by 
induction on the derivation of p h X. □ 

This emptiness check can be implemented in linear size with respect to the size of p, which 
itself is exponential in the size of A. 
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As we explained above, our algorithm splits the type-checking process in two phases: first, we 
compute an alternating tree automaton from the output type and the mtt; second, we check 
emptiness of this tree automaton. In this section, we are going to describe some details and 
optimizations about these two phases. 



4.1 Backward inference 

A simple algorithm to compute the input type as an alternating tree automaton is to follow 
naively the formal construction given in Section 03 A first observation is that it is possible to 
build the automaton lazily, starting from the initial states, producing new states and computing 
<£(_) only on demand. This is sometimes useful since the emptiness check algorithm we are 
going to describe in the next section works in a top-down way and will not always materialize 
the whole automaton. 

The defining equations for the function Inf as given in Section [3] produce huge formulas. 
We will now describe new equations that produce much smaller formulas in practice. Before 
describing them, it is convenient to generalize the notation Inf (e, q, q) by allowing a set of states 
q C Q instead of a single state q € Q for the output type. Intuitively, we want lni(e,q, q) to 
be semantically equivalent to \f q£ q Inf(e, q, q). We obtain a direct definition of Inf(e,g, q) by 
adapting the rules for Inf(e, q, q): 

Ini(bW(e u ...,e m ),q,q) \f /\ hdfa, 

(q^b( m ) (q>))£A,q£qj = 1 -< m 

hd(pW(x h ,e 1 ,...,ei),q,d = \J I k (p« , q, q') A /\ Inf( ej , {^}, q) 

q'£Ql \ j=l,...,l 



T ( qj G q) 

-L (qj £ q) 



We have used the notation [h {p^ l \q,q')- Intuitively, this should be semantically equivalent to 
the union Vqeg ^ h (P > 9' l')- Instead of using this as a definition, we prefer to change the set 
of states of the automaton: 

3 = {<pW,q,qi,...,q k )\pM€P,qQQ,?€Q k } 

=o = {(po,Qf) I Po G P } 

^((p {k) ,q,q),a^) = V( P (fc)(aM(x),y)=e)ei? Inf ( e >9, q)- 

In theory, this new alternating tree automaton could have exponentially many more states. 
However, in practice, and because of the optimizations we will describe now, this actually reduces 
significantly the number of states that need to be computed. 

The sections below will use the semantical equivalence \/ q€q -Inf(e, {q}, q) ~ Inf(e, q, q) men- 
tioned above in order to simplify formulas. 

4.1.1 Cartesian factorization 

The rule for the constructor expression b^ m \e\, . . . , e m ) can be written: 

ln^ m \e 1 ,...,e m ),q,q}= \f f\ Iuffo, {^}, q) 



RR n° 0123456789 



wlere A(g,M m )) = \q> \ q <- b^jq') £ A, g £ q} C Q m , f^^Mfi^t^Weg^^ 
decomposition of this set A(g, ft( m )) as a union of I Cartesian products: 

A(g,6 (m) ) = (g}x...x^)U...U(^x...x^) 

where the qj are sets of states. It is always possible to find such a decomposition: at worst, 
using only singletons for the g*, we will have as many terms in the union as m-tuples in 
A(g, M m )). But often, we can produce a decomposition with fewer terms in the union. Let 
us write Cart (A (g, 6( m )) for such a decomposition (seen as a subset of (2 c ?) m ). One can then 
use the following rule: 

Inf(&( m )( ei ,...,e m ),g,g-)= \J /\ Inf( ej , q j: q) 

(5 1 ,.-,g m )eCart(A(g,fe( m )))i= 1 .-- m 

4.1.2 State partitioning 

Intuition The rule for procedure call enumerates all the possible states for the value of pa- 
rameters of the called procedure. In its current form, this rule always produces a big union with 
\Q\ l terms. However, it may be the case that we don't need fully precise information about the 
value of a parameter to do the backward type inference. 

Let us illustrate that with a simple example. Assume that the called procedure p^ has a 
single parameter y\ and that it never does anything else with y\ than copying it (that is, any rule 
for p whose right-hand side mentions y\ is of the form p^ {a^ n \xi, . . . ,x n ),y±) = y\). Clearly, 
all the states (p,q,q'i) with q[ € g are equivalent, and similarly for all the states (p,q,q'{) with 
q'{ g. This is because whether the result of the procedure call will be or not in q only depends 
on the input tree (because there might be other rules whose right-hand side don't involve y\ 
at all) and on whether the value for the parameter is itself in g or not. In particular, we don't 
know to know exactly in which state the accumulator is. So the rule for calling this procedure 
could just be: 

lai(p(x h ,ei),q,q) 

= V ^ (P,q,q'i) Alnf(ei,{gi},g) 

= [\J ih (P,q,q[) Alnf( ei ,{gi},g) J U I \J [ h (p, q, q'{) A Inf( ei , {q'{}, q) 

= (ih {p,q,q[) Alnf(ei,g,g)) V (U {p,q, <//} A Inf(ei, Q\q, q)) 
where in the last line q[ (resp. q'{) is chosen arbitrarily in q (resp. Q\q). 

A new rule More generally, in the rule for a call to a procedure p^ l \ we don't need to consider 
all the Z-tuples g', but only a subset of them that capture all the possible situations. First, we 
assume that for given procedure and output type g, one can compute for each j = 1, an 
equivalence relation E{p( l \q, j) such that: 

(Vj = 1,..,/. G E{p®,q,j)) => {p {l \q,q>) ^ (/\g,g") (*) 
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lnf(p«\x h ,e 1 ,...,e l ),q,q) = V [ih(P {l) ,q,q')/\ /\ Inf(e„ {<£}, q) 

Let us split this union according to the equivalence class of the q'j modulo the relations E{p( l \q,j). 
If for each j, we choose an equivalence class qj for the relation E(p( l \q,j) (we write qj < 
E(p( l \q,j)), then all the states (p^ l \q, q') with q' E q 1 x . . . xq l are equivalent to (p^ l \q, C(q 1 x . . . x q~i)), 
where C is a choice function (it picks an arbitrary element from its argument). We can thus 
rewrite the right hand-side to: 

V iih{p {l) ,q,C(q 1 x...xq l ))A \f f\ Inf( ej , q) 

q 1 <E{p(. l ),q,l),...,q l <E( p W,q,l) \ q'eq 1 x...xq l 3=l,-,l 

The union of all the formulas Aj=i i ^ ( e j> Wj}' Q) f° r ?' ^ Qi x • • • x Qi 1S equivalent to 
Aj=i i Inf(ej, q~j, q). Consequently, we obtain the following new rule: 

Inf(pW(x h ,ei, . . .,e t ),q,q) = 

V {ih(p {l \q,C(q 1 x...xq l ))A /\ In^ej^q) 

q^EipW^^.^EipW^l) \ j=l,-,l 

In the worst case, all the equivalence relations E{p( l \q,j) are the identity, and the right-hand 
side is the same as for the old rule. But if we can identify larger equivalence classes, we can 
significantly reduce the number of terms in the union on the right-hand side. 



Computing the equivalence relations Now we will give an algorithm to compute the 
relations E(p( k \q, j) satisfying the condition (*). We will also define equivalence relations 
E[e,q,j] for any (n, /c)-expression e (with j = 1, .., k), such that: 

(Vj = l,..,k.(Qj,9j) e E[e,q,j\) Inf(e,?,<?) ~ Inf(e,g,g 7 ') 

We can use the rules used to define the formulas Inf (e, q, q) in order to obtain sufficient conditions 
to be satisfied so that these properties hold. We will express these conditions by a system of 
equations. Before giving this system, we need to introduce some notations. If E\ and E2 are two 
equivalence relations on Q, we write E\ C E2 if E2 C E± (when equivalence relations are seen 
as subsets of Q 2 ). The smallest equivalence relation for this ordering is the equivalence relation 
with a single equivalence class. The largest equivalence relation is the identity on Q. For two 
equivalence relations E±,E2, we can define their least upper bound E\ U E2 as the set-theoretic 
intersection. For an equivalence relation E and a set of states q, we write q < E if q is one of 
the equivalence class modulo E. Abusing the notation by identifying an equivalence relation 
with the partition it induces on Q, we will write {Q} for the smallest relation and {q, Q\q} for 
the relation with the two equivalence classes q and its complement. The system of equations is 
derived from the rules used to define the function Inf: 
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E[b( m \ ei ,...,e m ),q,i] □ \_\{E[ ej ,qjA \ fa, . . . , q m ) G Cart(A(g, &W)), j = l..m} 

£[p(')(x h , e i,..., e ,),g,i] □ |J{ J E7[e j ,g J -,i] | ^ < E(p«\ q,j), j = 1..1} 

{q,Q\q} (i=j) 
{Q} (i^j) 



E[yj,q,i] □ 



E(p( k \q,j) □ \_\{E[e,q,j] \ p^(a (n \x),y) = e) G R} 

Let us explain why these conditions imply the required properties for the equivalence rela- 
tion and how they are derived from the rules defining Inf. We will use an intuitive induc- 
tion argument (on expressions), even though a formal proof actually requires an induction 
on trees. Consider the rule for the procedure call. The new rule we have obtained above 
implies that in order to have Inf(pW(x/j, e±, . . . , e\), q, q') ~ Inf(pW(x/j, ei, . . . , e;), q, q"), it is 
sufficient to have Inf(ej, qj, q') ~ Inf(ej,</j, q") for all j = and for all qj < E{p( l \q,j), 

and thus, by induction, it is also sufficient to have {q'^q") G E[ej,qj,i] for all i, for all 
j = and for all qj < E(p( l \q, j) . In other words, a sufficient condition is (q'^q") G 

f]{E[ej ,qj,i] \ 7jj < E(p( l \q,j), j = I.. I}, from which we obtain the equation above (we re- 
call that U corresponds to set-theoretic intersection of relations). The reasoning is similar for 
the constructor expression. Indeed, the rule we have obtained in the previous section tells us 
that in order to have Inf(?/ m )(ei, . . . , e m ), q, q') ~ Inf(6( m )(ei, . . . , e m ), q, q"), it is sufficient to 
have Inf(ej, q~j, q') ~ Ini(ej,~q~j,q") for all (q l7 . . . ,q m ) G Cart(A(<?, b^)) and j = 1, .., m. 

As we explained before, it is desirable to compute equivalence relations with large equivalence 
classes (that is, small for the C ordering). Here is how we can compute a family of equivalence 
relations satisfying the system of equations above. First, we consider the CPO of functions 
mapping a triple (e, q, i) to an equivalence relation on Q and we reformulate the system of 
equation as finding an element x of this CPO such that f(x) C x, where / is obtained from the 
right-hand sides of the equations. To compute such an element, we start from xq the smallest 
element of the CPO, and we consider the sequence defined by x n+ \ = x n U f(x n ). Since this 
sequence is monotonic and the CPO is finite, the sequence reaches a constant value after a finite 
number of iterations. This value x satisfies f(x) C x as expected. We conjecture that this 
element is actually a smallest fixpoint for /, but we have no proof of this fact (note that the 
function / is not monotonic). 



4.1.3 Sharing the computation 

Given the rules defining the formulas Inf(e,g, q), we might end up computing the same formula 
several times. A very classical optimization consists in memoizing the results of such computa- 
tions. This is made even more effective by hash-consing the expressions. Indeed, in practice, for 
a given mtt procedure, many constructors have identical expressions. 



4.1.4 Complementing the output 

In the example at the beginning of the previous subsection, we have displayed a formula where 
both Inf (e, q, q) and Inf (e, Q\q, q) appear. One may wonder what is the relation between these 
two sub-formulas. Let us recall the required properties for these two formulas: 

{lni(e,q,q)j={v | {pj(v,w) n fq] + 0} 

{lni(e,Q\q,q)j = {v | {pj(v,w) D {Q\q] + 0} 
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deterministic function (that is, if JpJ (-u, xu) is always a singleton), then [Inf (e, Q\q, q)] is the 
complement of [Inf (e, q, q)J . If we extend the syntax of formula in alternating tree automata 
with negation (whose semantics is trivial to define), we can thus introduce the following rule: 

Inf '(e,q,q) = -Inf (e, Q\q, q) 

to be applied e.g. when the cardinal of q is strictly larger than half the cardinal of Q. In practice, 
we observed a huge impact of this optimization: the number of constructed states is divided 
by two in all our experiences, and the emptiness algorithm runs much more efficiently Also, 
because of the memoization technique mentioned above, this optimization allows us to share 
more computation. That said, we don't have a clear explanation for the very important impact 
of this optimization. 

The rule above can only be applied when the expression e denotes a total and deterministic 
function. We use a very simple syntactic criterion to ensure that: we require all the reachable 
procedures to have exactly one rule p( k \cS- n \x\, . . . ,x n ),yi, . . . ,yk) — > e for each symbol 

4.2 Emptiness algorithm 

In this section, we describe an efficient algorithm to check emptiness of an alternating tree 
automaton. Instead of giving directly the final version of the algorithm which would look quite 
obscure, we prefer to start describing formally a simple algorithm and then explain various 
optimizations. 

Let A = (3,Ho,$) be an ata as defined in Section E21 Negation (as introduced in Sec- 
tion H7T31) wm be considered later when describing optimizations. The basic algorithm relies on 
a powerset construction to translate A into a bottom-up tree automaton M. = (Q,Qf, A). We 
define Q as the powerset 2~. Intuitively, a state X = {Xi, . . . ,X m } in Q represents the inter- 
section of the ata states Xi. For such a state and a tag a^ n \ one must thus consider the formula 
V?(X,a (n) ) = Ai=i,.., m $(^,a (n) ), and put in A transitions of the form X «- a( n )(Ii, . . . ,X n ) 
to mimic the formula f(X,a ( - n ^). First, we put ip(X,a^) in disjunctive normal form, using the 
DNF function introduced in Section O 

^ j0 W)=f v A A ^ x 

(Xi,...,X n )gDNF(^(X,a( n ))) i=h-,nxeXi 

The transition relation A consists of all the transitions X <— a^ n \Xi, . . . ,X n ) such that 
(Xi,...,X n ) G DNF(v?(X,a( n ))). One defines Q F = {{X} | X € E }. One can easily establish 

that l~X} M = f)xex 1 X Ia and thus that C i M ) = 

It is well-known that deciding emptiness of a bottom-up tree automaton can be done in 
linear time. The classical algorithm to do so works in a bottom up way and thus requires to 
fully materialize the automaton (which is of exponential size compared to the original ata). 
However, the construction above produces the automaton in a top-down way: for a given state 
X, the construction gives all the transitions of the form X <— .... We can exploit this fact to 
derive an algorithm that doesn't necessarily require the whole automaton M. to be built. The 
algorithm is given below in pseudo-code. The function empty takes a state X and returns true 
if it is empty or false otherwise. The test is done under a number of assertions represented by 
two global variables P,N which stores sets of .M-states. The set stored in P (resp. N) represents 
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positive (resp. negative) emptiness assumptions: states which are^a%^mM^'b¥ ( Mm^)t^'^e^f i i. 
non-empty). When the state X under consideration is neither in P or N, it is first assumed to be 
empty (added to P). This assumption is then checked recursively by exploring all the incoming 
transitions (for all possible tags and all components of the disjunctive normal form corresponding 
to this tag) and if a contradiction is found, the set of positive assumptions is backtracked and 
X is added to the set of negative assumptions. This memoization-based scheme is standard for 
coinductive algorithms. 

function empty (X) 

if X E P then return true 
if X E N then return false 
let P_saved = P in 
P <- P U {X}; 
f oreach a (n) E £ 

if not (empty_f ormula (f(X, a^)) ) then 

P <— P_saved 

N := N U {X} 

return false 
return true 

function empty_f ormula ((f)) 

f oreach (X u . . . ,X n ) E _DNF(0)_ 

if not (empty_sub (Xi,...,X n )) then 
return false 
return true 

function empty_sub (Xi, . . . , X n ) 
f oreach 1 < i < n 
if (empty Xi) then 
return true 
return false 

This algorithm is not linear in the size of the automaton J\A because of the backtracking on P. 
This backtracking can be avoided (as described in [6], Chapter 7 or in [22] ), but the technique is 
rather involved and would make the presentation of the optimizations quite obscure. Moreover, 
we have indeed implemented the non-backtracking version (with all the optimizations) but we 
did not observe any noticeable speedup in our tests. 

A first optimization improves the effectiveness of the memoization sets P and N. It is based on 
the fact that if X\ C Xi then [X2] ^ [Aa]. As a consequence, if X C X for some X E P, then 
empty(A) can immediately return true. Similarly, if X C X for some X E N, then empty(X) 
can immediately return false. 

Enumeration and pruning of the disjunctive normal form The disjunctive normal form 
of a formula can be exponentially larger than the formula itself. Our first improvement con- 
sists in not materializing it but enumerating it lazily with a pruning technique that avoids the 
exponential behavior in many cases. 

function empty_f ormula ((f)) 
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function empty_dnf (1, ((Xi, . . . ,X n ) as a)) = 
match 1 with 
I [] -> return false 

IT:: rest -> return (empty_dnf (rest, a)) 
I _L : : rest -> return true 
I 4>i V 02 — rest -> 

if not (empty_dnf : : rest, a)) then return false 

return (empty_dnf (cp2 ■■ rest, a)) 
I <Pi A 4>2 rest -> 

return (empty_dnf (0i : :<^2 : :rest , a) ) 
I l h X : : rest_-> 

if empty (X^U {X})) then return true 

return (empty_dnf (rest , (Xi, . . . , X^ U {X}, . . . , X n ))) 

The first argument of empty_dnf is a list of formula whose conjunction must be put in 
disjunctive normal form. The second argument is an n-tuple (where n is the arity of the current 
symbol) which accumulates a "prefix" of the current term of the disjunctive normal form being 
built. When an atomic formula lh X is found, the state X is added to the h-th component of 
the accumulator. Here we have included an important optimization: if the new state Xh U {X} 
denotes an empty set, then one can prune the enumeration. For instance, for a formula of the 
form li X A (j) where X turns out to be empty, the enumeration will not even look at 4>. This 
optimization enforces the invariant that no component of the accumulator denotes an empty 
set. As a consequence, when the function empty_dnf reaches an empty list of formulas, the 
accumulator represents an element of the disjunctive normal form for which empty_sub would 
return false. 

The order in which we consider the two sub-formulas <f>i and 4>2 in the formulas 4>i A 4>2 and 
(f>i V <p2 might have a big impact on performances. It might be worthwhile to look for heuristics 
guiding this choice. 



Witness It is not difficult to see that the algorithm can be further instrumented in order to 
produce a witness for non-emptiness (that is, when empty(X) returns false, it also returns a 
tree v which belongs to \X\). To do so, we keep for each state in N a witness, and we also 
attach a witness to each component of the accumulator (Xi, . . . ,X n ) in the enumeration for 
the disjunctive normal form. When checking for the emptiness of Xh U {X}, we know that Xh 
is a non-empty state, and we have at our disposal a witness v for this state. Before doing the 
recursive call to empty, we can first check whether this witness v is in [X] (this can be done 
very efficiently). If this is the case, we know that X^ U {X} is also non-empty. In practice, this 
optimization avoids many calls to empty. 

Negation and reflexivity We have mentioned in Section \4. 1.41 an optimization which intro- 
duces alternating formulas with negation. Using De Morgan's laws, we can push the negation 
down and thus assume that it can only appear immediately above an atomic formula U X. Of 
course, it is possible to get rid of the negation by introducing for each state X a dual state 
->X whose transition formula (for each tag) is the negation of the one for X; this only doubles 
the number of states. However, we prefer to support directly in the algorithm negated atomic 
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Ic&mulas -> jj X, because we can use the very simple fact that it 4^i8te^Pa ; '^fet $W&P (H%§,°M8t 
intersect J,j X. The algorithm is thus modified to work with pairs of sets of .4-states, written 
(X,Y), which intuitively represents the set flxeX 1 X Ia\ Uy G y 1 Y Ia- We denne f((X,Y), a^) 
as Axex ^K-^-> A AyeY _,( ^(^ aW). The fact mentioned above translates itself into a short- 
cut case in the empty function: if the input is (X, Y) with X n y 7^ 0, then the result is true 
(meaning that (X, 1") trivially denotes an empty set of trees). 
The interesting cases for enumeration of the normal form are: 

I [ h X : : rest_-> 

if empty (X^U{X})) then return true 

return (empty_dnf (rest ,((Xi,Fi), . . . , (X~h U {X},Yh), 
I -1 [ h Y : : rest -> 

if empty (YhU{Y~})) then return true 

return (empty_dnf (rest , ((X 1 ,Y 1 ), . . . , (X h ,Y h U {Y}), 



...,(X n ,Y n )))) 
...,(Z n ,F n )))) 



Preprocessing Note the following trivial facts: For a formula <j)\ A fa to be empty, it is 
sufficient to have (pi or §2 empty; for a formula <\>\ V §2 to be empty, it is sufficient to have <p\ 
and cf>2 empty; for a formula U X to be empty, it is sufficient to have all the formulas <&(X, a( n ') 
empty; for a formula -1 Ij X to be empty, it is sufficient to have all the formulas ->&(X, a( n >) 
empty. 

Using these sufficient conditions and a largest fixpoint computation, we get a sound and 
efficient approximation of emptiness for formulas (it returns true only if the formula is indeed 
empty, but it may also return false is this case). We use this approximate criterion to replace 
any subformula which is trivially empty with _L and any subformula (f> such that —«f) is trivially 
empty with T (and then apply Boolean tautologies to eliminate _L and T as arguments of V 
or A). In practice, this optimization is very effective in reducing the size and complexity of 
formulas involved in the real (exact) emptiness check. 



5 Experiments 

We have experimented on our typechecker with various XML transformations implemented as 
mtts. Although we did not try very big transformations, we did work with large input and 
output tree automata automatically generated from the XHTML DTD (without taking XML 
attributes into account). Note that because this DTD has many tags, the mtts actually have 
many transitions since they typically copy tags, which requires all constructors corresponding to 
these tags to be enumerated. They do not have too many procedures, though. The bottom-up 
deterministic automaton that we generated from the XHTML DTD has 35 states. 

TableCQgives the elapsed times spent in typechecking several transformations and the number 
of states of the inferred alternating tree automaton that have been materialized. The experiment 
was conducted on an Intel Pentium 4 processor 2.80Ghz, running Linux kernel 2.4.27, and the 
typechecking time includes the whole process (determinization of the output type, backward 
inference, intersection with the input type, emptiness check). The typechecker is implemented 
in and compiled by Objective Caml 3.09.3. 

We also indicate the number of procedures in each mtt, the maximum number of parameters, 
and the minimum integer b, if any, such that the mtt is syntactically 6-bounded copying. Intu- 
itively, the integer b captures the maximum number of times the mtt traverses any node of the 
input tree. This notion has been introduced in [12] where the existence of b is shown to imply 
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observe that even unbounded-copying mtts can be typechecked efficiently. 



Transformation: 


(1) 


(2) 


(3) 


(4) 


(5) 


(6) 


(7) 


# of procedures: 


2 


2 


3 


5 


4 


6 


6 


Max # of parameters: 


1 


1 


1 


1 


2 


2 


2 


Bounded copying: 


1 


1 


2 


oo 


oo 


2 


1 


Type-checking time (ms): 


1057 


1042 


0373 


0377 


0337 


0409 


0410 


# of states in the ata: 


147 


147 


43 


74 


37 


49 


49 



Table 1: Results of the experiments 



Unless otherwise stated, transformations are checked to have type XHTML —> XHTML (i.e., both 
input and output types are XHTML). Transformation (1) removes all the <b> tags, keeping their 
contents. Transformation (2) is a variant that drops the <div> tags instead. The typechecker 
detects that the latter doesn't have type XHTML — > XHTML by producing a counter-example: 

<htmlxhead><title/></head><body><div/x/body> 

Indeed, removing the <div> element may produce a <body> element with an empty content, 
which is not valid in XHTML. Transformation (3) copies all the <a> elements (and their corre- 
sponding subtrees) into a new <div> element and prepends the <div> to the <body> element. 
Transformation (4) groups together adjacent <b> elements, concatenating their contents. Trans- 
formation (5) extracts from an XHTML document a tree of depth 2 which represents the concep- 
tual nesting structure of <hl> and <h2> heading elements (note that, in XHTML, the structure 
among headings is fiat). Transformation (6) builds a tree representing a table of contents for 
the top two levels of itemizations, giving section and subsection numbers to them (where the 
numbers are constructed as Peano numerals), and prepends the resulting tree to the <body> 
element. Transformation (7) is a variant that only returns the table of contents. 

We have also translated some transformations (that can be expressed as mtts) used by 
Tozawa and Hagiya in |26j (namely htmlcopy, inventory, pref2app, pref2html, prefcopy). 
Our implementation takes between 2ms and 6ms to typecheck these mtts, except for inventory 
for which it takes 22 ms. Tozawa and Hagiya report performance between 5ms and 1000ms on 
a Pentium M 1.8 Ghz for the satisfiability check (which corresponds to our emptiness check and 
excludes the time taken by backward inference). Although these results indicate our advantages 
over them to some extent, since the numbers are too small and they have not undertaken 
experiments as big as ours, it is hard to draw a meaningful conclusion. 

6 Conclusion and Future Work 

We have presented an efficient typechecking algorithm for mtts based on the idea of using 
alternating tree automata for representing the preimage of the given mtt obtained from the 
backward type inference. This representation was useful for deriving optimization techniques 
on the backward inference phase such as state partitioning and Cartesian factorization, and 
was also effective for speeding up the subsequent emptiness check phase by exploiting Boolean 
equivalences among formulas. Our experimental results confirmed that our techniques allow 
us to typecheck small sizes of transformations with respect to the full XHTML type. Finally, 
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have also made an exact connection to two known algorithms'^ e^lsT^^onffia'Sff 
Perst-Seidl's, the latter implying an important polynomial complexity under a bounded-copying 
restriction. 

The present work is only the first step toward a truly practical typechecker for mtts. In 
the future, we will seek for further improvements that allow typechecking larger and more 
complicated transformations. In particular, transformations with upward axes can be obtained 
by compositions of mtts as proved in [11] and a capability to typecheck such compositions of mtts 
in a reasonable time will be important. We have some preliminary ideas for the improvement 
and plan to pursue them as a next step. In the end, we hope to be able to handle (at least a 
reasonably large subset of) XSLT. 
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^ Comparison Alain Frisch , Haruo Hosoya 

In this section, we compare our algorithm with two existing algorithms, the classical one based 
on function enumeration and the Maneth-Perst-Seidl algorithm. 

A.l Classical Algorithm 

The classical algorithm presented here is known as a folklore. Variants can be found in the 
literature for deterministic mtts [4j and for macro forest transducers [20]. The algorithm takes 
a dbta M = (Q, Qf, A) and an mtt T = (P, P , II) and builds a dbta M' = (D, Dp, 5) where: 

D = {(p {m \q) | p (m) G P, q G Q m } -> 2« 

D F = {deD\ P0 € P , d(( Po )) n Q F + 0} 

S = {d <_ a (n)(J) | rf((pH, 9 -)) = U( P ( m)(aMWi?7He)en DInf(e,d:^} 

Here, the function DInf is defined as follows. 

DInf(b( m \ ei ,...,e m ),d,q) = {q' \ q' <—J)^(q') G A, q'- G DInf(e,-, <i, q) Vj = 1, . . . , to } 
DInf(p(xft,ei,. . .,ei),d,q) = [){d h ((p,q')) \ q\ G DInf (e^, d, q), % = 1, ... ,1} 
DInf(yj,d,q) = {qj} 

The constructed automaton N' has, as states, the set of all functions that map each pair of a 
procedure and parameter types to a set of states. Intuitively, each state d represents the set of 
trees v such that, given a procedure p( m ^ and states q, the set of results of evaluating p with 
the tree v and parameters w of types q is exactly described by the states d{{p,q)). Thus, the 
initial states Dp represent the set of trees v such that the set of results from evaluating an initial 
procedure po with v contains a tree accepted by the given dbta M.. 

The function DInf computes, from given expression e, states d from D, and states q from Q, 
the set of states that exactly describes the set of results of evaluating e with a tuple v of trees 
of types d and parameters of types q. Then we can collect in 5 transitions d <— a*™** (d) for all 
a,( n ) and all d such that d is computed for all p^ and all q by using DInf with the expression 
on p( m )'s each rule for the symbol a^ n \ By this intuition, each of the three cases for DInf can 
be understood as follows. 

• The set of results of evaluating the constructor expression &( m ) (ei, . . . , e m ) is described by 
the set of states q' that have a transition q' *— (q') G A such that each q[ describes the 
results of evaluating the corresponding subexpression e^. 

• The set of results of evaluating the procedure call p(xh, ei, . . . , ej) is the set of results of 
evaluating p with the h-th input tree Vh and parameters resulted from evaluating each ej. 
This set can be obtained by collecting the results of applying the function dh to p and q' 
where each q\ is one of the states that describe the set of results of ej. 

• The set of results of evaluating the variable expression yj is exactly described by its type 

Thus, the intuition behind is rather different from our approach. Nevertheless, we can prove 
that the resulting automaton from the classical algorithm is isomorphic to the one obtained from 
our approach followed by determinization. 
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dbta Af = (R, Rf, T) where 

R = 2 s 

R F = {r £ H | rnHo / 0} 

T = {r <- a^ n \r) \ r = {X \ r h $(X, a ( n >)}}. 
Here, the judgment r h is defined inductively as follows. 

• r h 0i A (f>2 if r h 0i and r h 02 • 

• r h 0i V <p2 if r l~ 01 or r h 02 • 

• fh T. 

• r h^X if X Grj, 

That is, r h intuitively means that holds by interpreting each jj X as "X is a member of 
the set rj". 

The intuition behind determinization of an ata is the same as that of a nondeterministic 
tree automaton. That is, each state r in M denotes the set of trees v that have type X for all 
members X of r and do not have type Y for all non-members Y of r. 

w = n m \ u i y i ( 6 ) 

This implies that any tree cannot have type r and r' at the same time when r ^ r' . Thus, the 
states of the tree automaton M form a partition of all the trees, that is, J\f is complete and 
deterministic. From this, we can understand the equivalence between A and J\f since each final 
state in N contains an initial state in the original ata A and therefore the set of such final states 
forms a partition of the sets denoted by the initial states of A. Then, by using the formula J6)), 
the interpretation "X is contained in r," of jj X in the judgment r h implies that \rj\ C [X]. 
Here, we can see a parallelism between the intuition of the judgment v h (where jj X is 
interpreted "v, G [X]") and that of r h 0. Indeed, a key property to the proof below is: v h if 
and only if r h for some r such that u € [f] . 

Proposition 2 .A and A/" are equivalent. 

Proof: To prove the result, it suffices to show the following. 

v€ [rj r = {X | v G [X]}. (7) 

(Note that this is a rewriting of the equation (EJ.) Indeed, this implies 

v e lR F j 

3r. (r n H / A r = {X | v G [X]}) 
3X G So- u G [XI 
v G £(.4). 

The proof proceeds by induction on the structure of v. To show (J7|), the following is sufficient 

(3r. v£{r\ A f h 0) ?7h0. (8) 



u G £(Af) 
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Mice this implies Alam Fnsch > Haruo Hosoya 

a (n) (if) G [r] <=► 3(r <- a (n) (r)) G F. if G [f] 

<=> 3r. r = {X | r h $(X,a (n) )} A iT G [f] 

r = {X|?7h$(X,a (n) )} 
r = {X|aW(S)6[X]}. 

The proof of ([8]) itself is done by induction on the structure of (p. The "only if" direction is 
straightforward. For the "if" direction, let = {X \ vi G {XJ} for i = l,...,n. By the 
induction hypothesis, ((7]) gives Vi G JrJ. The rest is case analysis on <f>. 

• Case 4> = -L- This never arises. 

• Case (j) = T. This case trivially holds. 

• Case 4> =ih X, From u h 0, we have G [X] and therefore 1 E by the definition of 
rfe. This implies the result. 

• Case 4> = (pi A fa- By the induction hypothesis, if G [r'J and r' h 0i with u G [r"J and 
r" h <^>2 for some r' and r". Since J\f is deterministic, both r' and r" actually equal to f. 
Hence the result follows. 

• Case <j> = V 4>2- Similar to the previous case. □ 

Proposition 3 Let TV be obtained by determinizing the ata from the last section. Then, M and 
M' are isomorphic. 

PROOF: Define the function (3 from D to R as follows: 

(3(d) = {(p^,q,q) | p^ G P, q G Q m , q G d((p,q))} 

Clearly, (3 is bijective: (3~ l (r)((p,q)) = {q \ (p( m \q,q) G r}. It remains to show that (3 is an 
isomorphism between N and A/"', that is, (1) (3(Dp) = Rf and (2) (3 (8(d)) = T((3(d)) for each 
d. 

The condition (1) clearly holds since d(po) (1 Qf ^ iff (po, q) G (3(d) for some g G Qf- To 
prove (2), it suffices to show 

q G DInf(e, d, q) iff (3(d) h Inf (e, g, £). 

Here, (3(di, . . . , dfc) stands for ((3(d±), . . . , (3(dk))- The proof is by induction on the structure of 
e. 

• Case e = M m )(ei, . . . , e m ). 

q G DInf(e, d,q) <=> 3(g <- 6 {m) (g')) G A. Vj. G DInf (e j} d, g) 

3(q <- b^(q')) G A. Vj. /3(d) h Inffo,^ 

(<?<-&( m )(g'))eA i=l...,m 
/3(d) h Inf (e,g,g) 
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q G DInf (e, d, q) \j{d h (p, q') \ q[ G DInf (e i: d, q), i = 1, . . . , 1} 

<^=^ 3q'. q G dh(p, q') and q[ G DInf(ej, d, q') 

3q'. (p, ?j <p) G /3(4) and Vi. /?((?) h Inf(e i; ?J q>) 
/3(d) \~ \/ A Inf(ei,9,9')A|i (p,<Z,<?> 

/3(J)hInf(e,(?,g) 

• Case e = j/j. First, g € DInf (yj, d, q) iff q = qj. If q = qj, then lni(e,q,q) = T and 
therefore the RHS holds. If q ^ qj, then Inf(e, q, q) = _L and therefore the RHS does not 
hold. □ 

A. 2 Maneth-Perst-Seidl Algorithm 

First, for simplicity in comparing the two algorithms, following |12| . we consider an mtt where 
the input type is already encoded into procedures. That is, instead of the original mtt T , we 
take an mtt T' and a bta M m such that 



T\v) 



T(v) (vGC(M in )) 
(otherwise). 



That is, T behaves exactly the same as T for the inputs from C{Ai- m ) but returns no result for 
the other inputs. See [12J for a concrete construction. Having done this, we only need to check 
that {v | T'{v) n C(M) ^ 0} = 0. 

In Maneth-Perst-Seidl algorithm, we construct a new mtt U from T = (P, Po, U) specialized 
to the output-type dbta M = (Q,Qf,A) such that U(v) = T'(v) n C(M) for any tree v. This 
can be done by constructing the mtt U = (S, So, Q) where 

S = {(p^,q,q) im) | G P,q,qe Q m } 
So = {{po,q} | Po € P , Q G Qf} 

n = {{p {m \q,q){a^\x),y) -> e' | {p^ m \a^ n \x), y) — > e) G II, e' G Spec(e,q,q)}. 

Here, we define the function Spec as follows. 

Spec(a(e 1: ...,e n ),q,q) = {a(e' v „ . , e' n ) \ q <- a(q[,._.. ,q' n ) G A, Vi. e\ G Spec(ej, q\, q)} 
Spec(p(x h ,ei,...,ei),q,q) = {(p, q, q')(x h , e[, . . . , e[) \ q' G Q l , Vi e- G Spec(ej, q-, q)} 
Spec(yi,q,q) = {y { } 

Intuitively, each procedure (p, q, q) in the new mtt IA yields, for any input value v and for any 
parameters w of types q, the same results as p but restricted to type q: 

[(p (m) ,?,«)](^>[p (ra) ](f,»)n[ g ] 

Similarly, Spec(e, q, q) yields, for any input values v and for all parameters w of types q, the 
same results as e but restricted to type q: 

[Spec( e ,g,^](tf,«0 = [e](tf,t3)n[g] 



RR n° 0123456789 



Mter thus constructing the mtt U, the remaining is to check that ^fflfr^M&feon^W^is^^pffi, 
i.e., U(v) = for any value v. This can be done as follows. Define first the following system of 
implications p' where we introduce propositional variables X consisting of all subsets of S: 

p' = (X ^X 1 A ... AX n | 3aH 3e u ...,e k . V/ m ) G X. 3j. ( S ( m )(a( n )(f),y) -> ej) G O, 

Vi = 1, . . . , n. = {s' G S | 3j = 1, . . . , k. s'(xi, . . .) occurs in ej 

and then verify that p' h {s} for some s G So- Intuitively, each propositional variable X denotes 
whether there is some input v from which any procedure in the set X translates to some value 
with some parameters: 

3v. Vs (m) G X. 3w. {s {m) ](v, w)^® 

Now, we can prove that the system of implications obtained from the MPS and the one from 
our algorithm are exactly the same. From this, we can directly carry over useful properties found 
for the MPS algorithm to our algorithm. In particular, our algorithm has the same polynomial 
time complexity under the restriction of a finitely bounded number of copying |12j . 

Proposition 4 Given an input type that accepts all trees and the mtt T defined above, let A 
and p be the ata and the system of implications obtained by the algorithm in Section^ Let Ho 
be A's initial states. Then, (p, Ho) and (p',So) are identical. 

Proof: Note that both p and p' consist of all variables X where X is from the set P x Q x Q m . 
The result follows by showing X <= X\ A ... A X n G p iff ~X <= ~X\ A . . . A X n G p' . It suffices to 
show for any X and i, 

3e±, . . . , efc. Vs G X. 3j. (s(a(x),y) — > ej) G fl, Xi = {s' G S | 3j = 1, . . . , k. s'(xi, . . .) occurs in ej} 

iff __ _ 

(Xx,...,X n ) GDNF(/\ *(a,o)). 

sex 

This follows by showing that, for all (Xi, . . . , X n ) G DNF(Inf (ei, q±, qi) A ... A Inf (e^, q^, q~k)), 

3j = 1, . . . , k. s'(xi) occurs in Spec(ej, qj,qj) s' G X^. 

This can be proved by induction on |ei| + . . . + |e&| where \e\ is the size of e. □ 
Corollary 1 For any b-bounded copying mtt, our algorithm runs in polynomial time. 

B Alternating tree automata with bounded traversing 

The corollary in the last section depends on the proof of polynomiality from [12J. It gives the 
information that the emptiness check for alternating automata has polynomial time complexity 
when the alternating automata is obtained by the basic backward inference algorithm from 
Section [3] when applied to a 6-bounded copying mtt. It seems natural to look for a counterpart of 
the notion of 6-bounded copying for alternating automata that directly ensures the polynomiality 
of the emptiness check. 

Let A = (H, Ho, <3?) be an ata. For each state X G H, we define the maximal traversal number 
b[X] as the least fixpoint of a constraint system over M = {1 < 2 < . . . < oo}, the complete 
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form: 

b[X]>biMX,aW)} 
for € S and 1 < i < n, where bi[4>] is defined inductively: 

bi[T] = 

bt[±] = 

bilfa A fc] = 6i[0i]+fei[02] 
6i[0iV^ 2 ] = max^^i],^^]) 

if i = 7^ 



if » ^ h 



The ata „4 is (syntactically) 6-bounded traversing if b[X] < b for all X £ Xq. 

We mention without proving it formally that when we apply our backward inference algo- 
rithm to a 6-bounded copying mtt, then the resulting ata is 6-bounded traversing. More precisely, 
we can show that b[{p( k \q,q)] < b\p( k >] where 6[p^ fc ^] denotes the maximal copy number for the 
procedure p( k \ as defined in [12| . As a matter of fact, the optimizations given in Section |4~T1 
preserve this property (but the ata formally has exponentially many more states, even if in 
practice only a fraction of them is going to be materialized). 

Now it remains to establish that the emptiness check for a 6-bounded traversing ata runs 
in polynomial time. We define b[X] as T, Xe j^b[X]. For any b- formula 4> and (X±, . . . ,X n ) G 
DNF(c/>) and 1 < i < n, we observe that b[Xi] < bi[<j)\. The proof is by induction on the structure 
of cj). As a consequence, for any (X\, . . . ,X n ) G DNF( Axex a ^))i we nave b[Xi] < b[X\. 
So, if the ata is 6-bounded traversing, then the emptiness check algorithm will only consider set 
of states X such that b[X] < b. Since b[X] is a lower bound for the cardinal of X (because 
b[X] > 1 for all X), we see that the algorithm only looks at a polynomial number of set of states 
X. 

To conclude this section, we observe that the intersection of a 6-bounded traversal ata and 
a ft'-bounded traversal ata is a (b + 6')-bounded traversal ata, and that a non-deterministic tree 
automaton is isomorphic to a 1-bounded traversal ata. This is useful to typecheck a 6-bounded 
copying mtt, because we need to compute the intersection of the inferred ata, which is 6-bounded 
traversal, and of the input type, which is given by a non-deterministic tree automaton. As a 
result, we obtain a (b + l)-bounded ata. 
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